{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HyperparameterHunter Extended Example\n",
    "\n",
    "In this example, we'll try to simulate a miniature project and go over some of the things you should expect when starting out with HyperparameterHunter and some of the things you might want to adjust."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "# import os\n",
    "# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from xgboost import XGBClassifier\n",
    "\n",
    "from hyperparameter_hunter import Environment, CVExperiment\n",
    "from hyperparameter_hunter.utils.learning_utils import get_breast_cancer_data\n",
    "from hyperparameter_hunter.utils.file_utils import print_tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we're importing the cross-validation scheme with which we want to start, along with the first algorithm we'll test (`StratifiedKFold` and `XGBClassifier`, respectively). Then, from `hyperparameter_hunter`, we import `Environment`, and `CVExperiment`, which are at the core of any project. We also import the following utility functions: `get_breast_cancer_data`, which puts the result of `sklearn.datasets.load_breast_cancer` into a DataFrame; and `print_tree`, which prints out a directory's contents so we can easily see what files are being created by completed Experiments.\n",
    "\n",
    "Below, we declare the directory to store hyperparameter_hunter's results, and print out what it looks like. The astute readers will notice that nothing is printed because that directory doesn't exist yet (unless you're running this with a populated directory, in which case the rest of this example won't make much sense)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "example_assets = 'HyperparameterHunterAssets'\n",
    "print_tree(example_assets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cross-Experiment Key: icnP8lYZqtpQSHW_tVVuo3Y6LPEQtYdTNMZcpfPf_cs=\n"
     ]
    }
   ],
   "source": [
    "env = Environment(\n",
    "    train_dataset=get_breast_cancer_data(),\n",
    "    results_path=example_assets,\n",
    "    metrics=['roc_auc_score'],\n",
    "    target_column='diagnosis',\n",
    "    cv_type=StratifiedKFold,\n",
    "    cv_params=dict(n_splits=3, shuffle=True, random_state=32),\n",
    "    file_blacklist=['script_backup']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We begin by instantiating an `Environment`, and giving it the following: \n",
    "* Our `train_dataset`,\n",
    "* Our `results_path` declared above,\n",
    "* A `metrics`, with 'roc_auc_score', because the Wisconsin Breast Cancer dataset is a classification problem,\n",
    "* `target_column`, a string naming the column in our dataset containing the target output (defaults to 'target'),\n",
    "* Our `cv_type` imported earlier,\n",
    "* `cv_params`, a dict containing all the arguments to pass to `cv_type`,\n",
    "* And `file_blacklist`, a list of result file names we don't want to save. Here, we're giving it 'script_backup' because we're executing our Experiments in a Jupyter notebook, instead of a standard Python script\n",
    "\n",
    "Notice that upon instantiation, our `Environment` logs the cross-experiment key produced by the provided parameters. This is important because it determines when two `Experiment`s can be properly compared; we'll go over this more later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First Experiment\n",
    "Let's dive right in and execute our first `Experiment` to see what happens! All we need to do is give it a `model_initializer` and `model_init_params`, and it'll work with our `Environment` to take care of the rest!\n",
    "\n",
    "`model_init_params` can contain any arguments expected by the `__init__` method of `model_initializer`. Additionally, our Experiment will figure out the default values of any arguments we don't provide and it'll record those too! So there's no need to give it a huge dictionary of arguments just to make sure all the hyperparameters get recorded!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<2018-06-08 16:41:42,817> Validated Environment with key: \"icnP8lYZqtpQSHW_tVVuo3Y6LPEQtYdTNMZcpfPf_cs=\"\n",
      "<2018-06-08 16:41:42,818> \n",
      "<2018-06-08 16:41:42,819> Initialized new Experiment with ID: 1d1627ac-1587-4dbd-8eb3-4611ae74505a\n",
      "<2018-06-08 16:41:42,820> Skipped creating backup of file: \"/Users/Hunter/hyperparameter_hunter/examples/<ipython-input-4-030e3e5c6c1c>\"\n",
      "<2018-06-08 16:41:42,836> Generated hyperparameter key: eGjxwq35MmEiMHtJ0ANMqvxKLFpp4ZVKUKXNgXOiumQ=\n",
      "<2018-06-08 16:41:42,841> Initial preprocessing stage complete\n",
      "<2018-06-08 16:41:42,845> \n",
      "\n",
      "<2018-06-08 16:41:42,850> Starting Repetition 0\n",
      "<2018-06-08 16:41:42,851> \n",
      "<2018-06-08 16:41:42,852> \n",
      "<2018-06-08 16:41:42,855> F0/R0  |  Seed: 10967   Time: 16:41:42\n",
      "<2018-06-08 16:41:43,039> F0/R0  |  OOF(roc_auc_score=0.95354)  |  Time Elapsed: 0.18405 s\n",
      "<2018-06-08 16:41:43,043> F0.0 AVG:   OOF(roc_auc_score=0.95354)  |  Time Elapsed: 0.19117 s\n",
      "<2018-06-08 16:41:43,046> \n",
      "<2018-06-08 16:41:43,048> F1/R0  |  Seed: 75062   Time: 16:41:43\n",
      "<2018-06-08 16:41:43,208> F1/R0  |  OOF(roc_auc_score=0.93822)  |  Time Elapsed: 0.15964 s\n",
      "<2018-06-08 16:41:43,211> F0.1 AVG:   OOF(roc_auc_score=0.93822)  |  Time Elapsed: 0.16497 s\n",
      "<2018-06-08 16:41:43,212> \n",
      "<2018-06-08 16:41:43,215> F2/R0  |  Seed: 60284   Time: 16:41:43\n",
      "<2018-06-08 16:41:43,370> F2/R0  |  OOF(roc_auc_score=0.97017)  |  Time Elapsed: 0.15577 s\n",
      "<2018-06-08 16:41:43,373> F0.2 AVG:   OOF(roc_auc_score=0.97017)  |  Time Elapsed: 0.16068 s\n",
      "<2018-06-08 16:41:43,375> \n",
      "<2018-06-08 16:41:43,376> Repetition 0 AVG:   OOF(roc_auc_score=0.95393)  |  Time Elapsed: 0.52512 s\n",
      "<2018-06-08 16:41:43,378> \n",
      "<2018-06-08 16:41:43,379> FINAL:    OOF(roc_auc_score=0.95393)  |  Time Elapsed: 0.53317 s\n",
      "<2018-06-08 16:41:43,380> \n",
      "<2018-06-08 16:41:43,397> Saving results for Experiment: \"1d1627ac-1587-4dbd-8eb3-4611ae74505a\"\n",
      "<2018-06-08 16:41:43,398> Saving result file for \"TestedKeyRecorder\"\n",
      "<2018-06-08 16:41:43,399> Saved cross_experiment_key: \"icnP8lYZqtpQSHW_tVVuo3Y6LPEQtYdTNMZcpfPf_cs=\"\n",
      "<2018-06-08 16:41:43,400> Saved hyperparameter_key: \"eGjxwq35MmEiMHtJ0ANMqvxKLFpp4ZVKUKXNgXOiumQ=\"\n",
      "<2018-06-08 16:41:43,402> Saving result file for \"LeaderboardEntryRecorder\"\n",
      "<2018-06-08 16:41:43,405> Saving result file for \"DescriptionRecorder\"\n",
      "<2018-06-08 16:41:43,407> Saving result file for \"PredictionsOOFRecorder\"\n",
      "<2018-06-08 16:41:43,412> Saving result file for \"HeartbeatRecorder\"\n"
     ]
    }
   ],
   "source": [
    "experiment_0 = CVExperiment(\n",
    "    model_initializer=XGBClassifier,\n",
    "    model_init_params=dict(objective='reg:linear', max_depth=3, n_estimators=100, subsample=0.5)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This may look like a bit much at first, so let's unpack everything. First notice, the log noting the `Experiment`'s connection with the `Environment` we created earlier. Then, it prints out the `Experiment`'s ID, which is a randomly generated UUID. Down another line, we have the `Experiment`'s `hyperparameter_key`, which hashes the inputs to `CVExperiment` to make a \"fingerprint\" for the `Experiment` to help `hyperparameter_hunter` figure out if we're running an `Experiment` it already has results for. No more wasting time with duplicate experiments! \n",
    "\n",
    "Skipping down a few lines, we see the results of fitting our model during Stratified K-Folds cross-validation for 1 repetition, 3 folds, and 1 run, followed by the final out-of-fold ROC-AUC score for the `Experiment`. The subsequent lines describe which result files are being saved.\n",
    "\n",
    "Now, let's print out the contents of our 'HyperparameterHunterAssets' directory, and notice how dramatically the landscape has changed!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[;1mHyperparameterHunterAssets/\u001b[0m\n",
      "|-- Heartbeat.log\n",
      "\u001b[;1m|-- Experiments/\u001b[0m\n",
      "\u001b[;1m|   |-- Descriptions/\u001b[0m\n",
      "|   |   |-- 1d1627ac-1587-4dbd-8eb3-4611ae74505a.json\n",
      "\u001b[;1m|   |-- Heartbeats/\u001b[0m\n",
      "|   |   |-- 1d1627ac-1587-4dbd-8eb3-4611ae74505a.log\n",
      "\u001b[;1m|   |-- PredictionsOOF/\u001b[0m\n",
      "|   |   |-- 1d1627ac-1587-4dbd-8eb3-4611ae74505a.csv\n",
      "\u001b[;1m|-- KeyAttributeLookup/\u001b[0m\n",
      "|   |-- cv_type.db\n",
      "|   |-- model_initializer.db\n",
      "|   |-- prediction_formatter.json\n",
      "\u001b[;1m|   |-- train_dataset/\u001b[0m\n",
      "|   |   |-- XacopYu-zgxwHX7CUJpgmPg6IPRmyjSOb-nWebZim0E=.csv\n",
      "\u001b[;1m|-- Leaderboards/\u001b[0m\n",
      "|   |-- GlobalLeaderboard.csv\n",
      "\u001b[;1m|-- TestedKeys/\u001b[0m\n",
      "|   |-- icnP8lYZqtpQSHW_tVVuo3Y6LPEQtYdTNMZcpfPf_cs=.json\n"
     ]
    }
   ],
   "source": [
    "print_tree(example_assets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "WARNING: Incoming wall of text!\n",
    "<br>\n",
    "Let's go through the new files one by one.\n",
    "\n",
    "### HyperparameterHunterAssets/\n",
    "* Contains one file (__'Heartbeat.log'__), and four subdirectories (__'Experiments/'__, __'KeyAttributeLookup/'__, __'Leaderboards/'__, and __'TestedKeys/'__).\n",
    "* __'Heartbeat.log'__ is the log file for the current/most recently executed `Experiment`. It will look very much like the above printed output of `CVExperiment`, with some additional debug messages thrown in. When the `Experiment` is completed, a copy of this file is saved as the Experiment's own Heartbeat file, which we'll see soon.\n",
    "\n",
    "#### Experiments/\n",
    "* Contains three subdirectories in this example; although, at the time of this file's writing, it can contain up to six different subdirectories. The files contained in each of the subdirectories all follow the same naming convention: they are named after the `Experiment`'s randomly-generated UUID. The subdirectories are as follows:\n",
    "    \n",
    "    ##### Descriptions/\n",
    "    Contains a .json file for each completed `Experiment`, describing all critical (and some extra) information about the Experiment's results. Such information includes, but is certainly not limited to: keys, algorithm/library name, final scores, model_initializer hash, hyperparameters, cross experiment parameters, breakdown of times elapsed, start/end datetimes, breakdown of evaluations over runs/folds/reps, source script name, platform, and additional notes. This file is meant to give you all the details you need regarding an `Experiment`'s results and what the conditions were that led to those results.\n",
    "    \n",
    "    ##### Heartbeats/\n",
    "    Contains a .log file for each completed `Experiment` that is created by copying the aforementioned __'HyperparameterHunterAssets/Heartbeat.log'__ file. This file is meant to give you a record of what exactly the `Experiment` was experiencing along the course of its existence. This can be useful if you need to verify questionable results, or check for error/warning/debug messages that might not have been noticed before.\n",
    "    \n",
    "    ##### PredictionsOOF/\n",
    "    Contains a .csv file for each completed `Experiment`, containing out-of-fold predictions for the `train_dataset` provided to `Environment`. If `Environment` is given a `runs` value > 1, or if a repeated cross-validation scheme is provided (like sklearn's `RepeatedKFold` or `RepeatedStratifiedKFold`), then OOF predictions will be averaged according to the number of runs and repetitions. An extended discussion of this file's uses probably isn't necessary, but just some of the things you might want it for include: testing the performance of ensembled models via their prediction files, or calculating other metric values, if, for example, we wanted an F1 score, or simple accuracy after the `Experiment` had finished, instead of the ROC-AUC score we told the `Environment` we wanted. Note that if we knew ahead of time we wanted all three of these metrics, we could have easily given the `Environment` all three (or any other number of metrics) at its initialization. See the 'custom_metrics_example.py' example script for more details on advanced metrics specifications.\n",
    "    \n",
    "    ##### PredictionsHoldout/ (EXTRA DIRECTORY)\n",
    "    This is the first of the three 'Experiments/' subdirectories not shown in this example. Its file structure is pretty much identical to __'PredictionsOOF/'__, and is populated when we use `Environment`'s `holdout_dataset` kwarg to provide a holdout DataFrame, a filepath to one, or a callable to extract a `holdout_dataset` from our `train_dataset`. Additionally, if a `holdout_dataset` is provided, the provided metrics will be calculated for it as well (unless you tell it otherwise).\n",
    "    \n",
    "    ##### PredictionsTest/ (EXTRA DIRECTORY)\n",
    "    The second subdirectory not shown in this example is much like __'PredictionsOOF/'__ and __'PredictionsHoldout/'__. It is populated when we use `Environment`'s `test_dataset` kwarg to provide a test DataFrame, or a filepath to one. It may be worth noting that the major difference between `test_dataset` and its counterparts (`train_dataset`, and `holdout_dataset`) is that test predictions are not evaluated because it is the nature of the `test_dataset` to have unknown targets.\n",
    "    \n",
    "    ##### ScriptBackups/ (EXTRA DIRECTORY)\n",
    "    The final subdirectory not shown (at the time of this example's writing) contains a .py file for each completed `Experiment` that is an exact copy of the script executed that led to the instantiation of the `Experiment`. These files exist primarily to assist in \"oh shit\" moments where you have no idea how to recreate an `Experiment`. 'script_backup' is blacklisted by default when executing a hyperparameter `OptimizationProtocol`, as all experiments would be created by the same file.\n",
    "\n",
    "#### KeyAttributeLookup/\n",
    "* This directory stores any complex-typed `Environment` parameters and hyperparameters, as well as the hashes with which those complex objects are associated. \n",
    "* Specifically, this directory is concerned with any python classes, or callables, or DataFrames you may provide, and will create a the appropriate file or directory to properly store the object. \n",
    "    * If a class is provided (as is the case with `cv_type`, and `model_initializer`), the Shelve and Dill libraries are used to pickle a copy of the class, linked to the class's hash as its key. \n",
    "    * If a defined function, or a lambda is provided (as is the case with `prediction_formatter`, which is an optional `Environment` kwarg), a .json file entry is created linking the callable's hash to its source code saved as a string, which can be recreated using Python's exec function.\n",
    "    * If a Pandas DataFrame is provided (as is the case with `train_dataset`, and its holdout and test counterparts), the process is slightly different. Rather than naming a file after the complex-typed attribute (as in the first two types), a directory is named after the attribute, hence the __'HyperparameterHunterAssets/KeyAttributeLookup/train_dataset/'__ directory. Then, .csv files are added to the corresponding directory, which are named after the DataFrame's hash, and which contain the DataFrame itself.\n",
    "* Entries in the __'KeyAttributeLookup/'__ directory are created on an as-needed basis. This means that you may see entries named after attributes other than those shown in this example along the course of your own project. They are created whenever `Environment`s or `Experiment`s are provided arguments too complex to neatly display in the `Experiment`'s __'Descriptions/'__ entry file. Some other complex attributes you may come across that are given __'KeyAttributeLookup/'__ entries include: custom metrics provided via `Environment`'s `metrics` and `metrics_params` kwargs, and Keras Neural Network `callbacks` and `build_fn`s.\n",
    "\n",
    "#### Leaderboards/\n",
    "* At the time of this notebook's writing, this directory contains only one file: __'GlobalLeaderboard.csv'__; although, more are on the way to assist you in comparing the performance of different `Experiment`s, and they should be similar in structure to this one. \n",
    "* __'GlobalLeaderboard.csv'__ is a DataFrame containing one row for every completed `Experiment`\n",
    "* It has a column for every final metric evaluation performed, as well as the following columns: 'experiment_id', 'hyperparameter_key', 'cross_experiment_key', and 'algorithm_name'\n",
    "* Rows are sorted in descending order according to the first metric provided, and will prioritize OOF evaluation before holdout evaluations if both are given.\n",
    "* If an `Experiment` does not have a particular evaluation, the `Experiment` row's value for that column will be null.\n",
    "    * This can happen if new metrics are specified, which were not recorded for earlier experiments, or if a `holdout_dataset` is provided to later `Experiment`s that earlier ones did not have.\n",
    "\n",
    "#### TestedKeys/\n",
    "* This directory contains a .json file named for every unique `cross_experiment_key` encountered.\n",
    "* Each .json file contains a dictionary, whose keys are the `hyperparameter_key`s that have been tested in conjunction with the `cross_experiment_key` for which the containing file is named.\n",
    "* The value of each of these keys is a list of strings, in which each string is an `experiment_id`, denoting an `Experiment` that was conducted with the hyperparameters symbolized by that list's key, and an `Environment`, whose cross-experiment parameters are symbolized by the name of the containing file.\n",
    "    * The values are lists in order to accomodate `Experiment`s that are intentionally duplicated.\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
